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Abstract: We consider a class of learning problems that involve a structured sparsity-inducing 
norm defined as the sum of ^oo-norms over groups of variables. Whereas a lot of effort has been 
put in developing fast optimization methods when the groups are disjoint or embedded in a specific 
hierarchical structure, we address here the case of general overlapping groups. To this end, we show 
that the corresponding optimization problem is related to network flow optimization. More precisely, 
the proximal problem associated with the norm we consider is dual to a quadratic min-cost flow 
problem. We propose an efficient procedure which computes its solution exactly in polynomial time. 
Our algorithm scales up to millions of variables, and opens up a whole new range of applications for 
structured sparse models. We present several experiments on image and video data, demonstrating 
the applicability and scalability of our approach for various problems. 

Key-words: network flow optimization, convex optimization, sparse methods, proximal algo- 
rithms 
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Algorithmes de Flots pour Parcimonie Structuree 



Resume : Nous considerons une classe de problemes d'apprentissage regularises par une norme in- 
duisant de la parcimonie structuree, definie comme une somme de normes sur des groupes de vari- 
ables. Alors que de nombreux efforts ont etes mis pour developper des algorithmes d'optimisation 
rapides lorsque les groupes sont disjoints ou structures hierarchiquement, nous nous interessons au 
cas general de groupes avec recouvrement. Nous montrons que le probleme d'optimisation corre- 
spondant est lie a l'optimisation de flots sur un reseau. Plus precisement, l'operateur proximal 
associe a la norme que nous considerons est dual a la minimisation d'un cout quadratique de not 
sur un graphe particulier. Nous proposons une procedure efficace qui calcule cette solution en un 
temps polynomial. Notre algorithme peut traiter de larges problemes, comportant des millions de 
variables, et ouvre de nouveaux champs d'applications pour les modeles parcimonieux structures. 
Nous presentons diverses experiences sur des donnees d'images et de videos, qui demontrent l'utilite 
et l'efficacite de notre approche pour resoudre de nombreux problemes. 

Mots-cles : optimisation de flots, optimisation convexe, methodes parcimonieuses, algorithmes 
proximaux 
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1 Introduction 

Sparse linear models have become a popular framework for dealing with various unsupervised and 
supervised tasks in machine learning and signal processing. In such models, linear combinations of 
small sets of variables are selected to describe the data. Regularization by the £i-norm has emerged 
as a powerful tool for addressing this combinatorial variable selection problem, relying on both a 
well-developed theory (see pQ and references therein) and efficient algorithms [2J 0, H] • 

The ^i-norm primarily encourages sparse solutions, regardless of the potential structural rela- 
tionships (e.g., spatial, temporal or hierarchical) existing between the variables. Much effort has 
recently been devoted to designing sparsity-inducing regularizations capable of encoding higher- 
order information about allowed patterns of non-zero coefficients [HI O El El [H]> with successful 
applications in bioinformatics [51 IIP], topic modeling [11] and computer vision [8]. 

By considering sums of norms of appropriate subsets, or groups, of variables, these regularizations 
control the sparsity patterns of the solutions. The underlying optimization problem is usually 
difficult, in part because it involves nonsmooth components. Proximal methods have proven to 
be effective in this context, essentially because of their fast convergence rates and their ability to 
deal with large problems [3J 13] . While the settings where the penalized groups of variables do not 
overlap [T^] or are embedded in a tree-shaped hierarchy [TT] have already been studied, sparsity- 
inducing regularizations of general overlapping groups have, to the best of our knowledge, never 
been considered within the proximal method framework. 

This paper makes the following contributions: 

• It shows that the proximal operator associated with the structured norm we consider can be 
computed by solving a quadratic min-cost flow problem, thereby establishing a connection 
with the network flow optimization literature. 

• It presents a fast and scalable procedure for solving a large class of structured sparse reg- 
ularized problems, which, to the best of our knowledge, have not been addressed efficiently 
before. 

• It shows that the dual norm of the sparsity-inducing norm we consider can also be evalu- 
ated efficiently, which enables us to compute duality gaps for the corresponding optimization 
problems. 

• It demonstrates that our method is relevant for various applications, from video background 
subtraction to estimation of hierarchical structures for dictionary learning of natural im- 
age patches. 



2 Structured Sparse Models 

We consider in this paper convex optimization problems of the form 

min /(w) + AQ(w), (1) 

where / : R p — »■ R is a convex differentiable function and fi : R p — > R is a convex, nonsmooth, 
sparsity-inducing regularization function. When one knows a priori that the solutions of this 
learning problem only have a few non-zero coefficients, f2 is often chosen to be the £i-norm, leading 
for instance to the Lasso [T3J. When these coefficients are organized in groups, a penalty encoding 
explicitly this prior knowledge can improve the prediction performance and/or interpretability of 
the learned models [121 IT4l [T5l [T6] . Such a penalty might for example take the form 

°( w ) - y^maxlwil = T^VgW^gWoo, (2) 
geG y geg 



RR n° 7372 



Network Flow Algorithms for Structured Sparsity 



4 



where Q is a set of groups of indices, Wj denotes the j-th coordinate of w for j in [l;p] = {1, . . . ,p}, 
the vector w g in R' 9 represents the coefficients of w indexed by g in Q, and the scalars r\ g are 
positive weights. A sum of 4-norms is also used in the literature [7], but the £oo-norm is piecewise 
linear, a property that we take advantage of in this paper. Note that when Q is the set of singletons 
of we get back the fi-nonn. 

If Q is a more general partition of [l;p], variables are selected in groups rather than individually. 
When the groups overlap, O is still a norm and sets groups of variables to zero together [S]. The 
latter setting has first been considered for hierarchies [TJ [TD1 [T7j , and then extended to general group 
structures Solving Eq. (|TJ) in this context becomes challenging and is the topic of this paper. 
Following [IT] who tackled the case of hierarchical groups, we propose to approach this problem 
with proximal methods, which we now introduce. 

2.1 Proximal Methods 

In a nutshell, proximal methods can be seen as a natural extension of gradient-based techniques, 
and they are well suited to minimizing the sum / + Af2 of two convex terms, a smooth function / 
— continuously differentiable with Lipschitz-continuous gradient — and a potentially non-smooth 
function Ail (see [TS] and references therein). At each iteration, the function / is linearized at the 
current estimate Wo and the so-called proximal problem has to be solved: 

min /(w ) + (w - w n ) T V/(w ) + Afi(w) + ^||w - w ||2- 

The quadratic term keeps the solution in a neighborhood where the current linear approximation 
holds, and L > is an upper bound on the Lipschitz constant of V/. This problem can be 
rewritten as 

min - Nil — w||2 + A'n(w), (3) 

with A' = X/L, and u = wp — -^V/(wo). We call proximal operator associated with the regulariza- 
tion X'ft the function that maps a vector u in R p onto the (unique, by strong convexity) solution w* 
of Eq. ©. Simple proximal method use w* as the next iterate, but accelerated variants [3l |4] are 
also based on the proximal operator and require to solve problem ([3]) exactly and efficiently to enjoy 
their fast convergence rates. Note that when SI is the ^i-norm, the solution of Eq. © is obtained 
by a soft-thresholding |18| . 

The approach we develop in the rest of this paper extends to the case of general overlapping 
groups when Jl is a weighted sum of ^-norms, broadening the application of these regularizations 
to a wider spectrum of problems 

3 A Quadratic Min- Cost Flow Formulation 

In this section, we show that a convex dual of problem ([3]) for general overlapping groups Q can 
be reformulated as a quadratic min-cost flow problem. We propose an efficient algorithm to solve it 
exactly, as well as a related algorithm to compute the dual norm of f2. We start by considering the 
dual formulation to problem ([3]) introduced in for the case where fi is a sum of -norms: 



1 Note that other types of structured sparse models have also been introduced, either through a different norm [6], 
or through non-convex criteria [8] [9] . 

2 For hierarchies, the approach of 1111 applies also to the case of where J! is a weighted sum of fc-norms. 
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Lemma 1 (Dual of the proximal problem [llj) 

Given u in R p , consider the problem 

min i||u-Vril2 s.t V 5 €0, ||Hi<A»/ fl and £f </ j • //. (4) 

where £ = (£ 9 ) g eg is m RP x l e l, and denotes the j-th coordinate of the vector £ 9 . Then, every 
solution = (£* 9 ) g eg of Eq. satisfies w* = u— £* 9 > where w* is the solution of Eq. (0). 

Without loss of generality!^ we assume from now on that the scalars u., are all non-negative, and we 
constrain the entries of £ to be non- negative. We now introduce a graph modeling of problem Q. 

3.1 Graph Model 

Let G be a directed graph G = (V, E, s, t), where V is a set of vertices, E C V x V a, set of arcs, s 
a source, and t a sink. Let c and c' be two functions on the arcs, c : E — >■ M. and c' : E — > K + , where 
c is a cost function and c' is a non-negative capacity function. A flow is a non-negative function 
on arcs that satisfies capacity constraints on all arcs (the value of the flow on an arc is less than or 
equal to the arc capacity) and conservation constraints on all vertices (the sum of incoming flows 
at a vertex is equal to the sum of outgoing flows) except for the source and the sink. 

We introduce a canonical graph G associated with our optimization problem, and uniquely 
characterized by the following construction: 

(i) V is the union of two sets of vertices V u and V gr , where V u contains exactly one vertex for 
each index j in [l;p], and V gr contains exactly one vertex for each group g in Q. We thus have 
|V| = \Q\ + p. For simplicity, we identify groups and indices with the vertices of the graph. 

(ii) For every group g in Q, E contains an arc (s,g). These arcs have capacity Xrj g and zero cost. 

(iii) For every group g in Q, and every index j in g, E contains an arc (<?,j) with zero cost and 
infinite capacity. We denote by £| the flow on this arc. 

(iv) For every index j in [l;p], E contains an arc (j, t) with infinite capacity and a cost Cj = |(u.j — £») 2 , 
where ^ is the flow on (j, t). Note that by flow conservation, we necessarily have ^ =X) g eg £f 



^ „„ „„.. „„ . ~j >;) ^ggy-Sj- 

Examples of canonical graphs are given in Figures l(a)|(c) The flows £| associated with G 



can now be identified with the variables of problem indeed, the sum of the costs on the edges 
leading to the sink is equal to the objective function of (j4]), while the capacities of the arcs (s,g) 
match the constraints on each group. This shows that finding a flow minimizing the sum of the 
costs on such a graph is equivalent to solving problem (j3J). 

When some groups are included in others, the canonical graph can be simplified to yield a graph 
with a smaller number of edges. Specifically, if h and g are groups with h C g, the edges (g,j) for 
j 6 h carrying a flow £ 9 can be removed and replaced by a single edge (g, h) of infinite capacity 
and zero cost, carrying the flow X^efc^j' This simplification is illustrated in Figure 1(d) with a 



graph equivalent to the one of Figure 1 (c) This does not change the optimal value of £ , which is 
the quantity of interest for computing the optimal primal variable w*. We present in Appendix 1X1 
a formal definition of equivalent graphs. These simplifications are useful in practice, since they 
reduce the number of edges in the graph and improve the speed of the algorithms we are now going 
to present. 



3 



Let ^* denote a solution of Eq. (3). Optimality conditions of Eq. (3) derived in show that for all j in [l;p], 



the signs of the non-zero coefficients £* 9 for g in Q are the same as the signs of the entries Uj . To solve Eq. HI 



en it- 



can therefore flip the signs of the negative variables Uj , then solve the modified dual formulation (with non- negative 
variables), which gives the magnitude of the entries £* 9 (the signs of these being known). 
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A % 



£2 +£3 < x Vh 




(a) e = {ff = {l,2,3}}. 



(b) e = { fl = {l,2},h={2,3}}. 




|2+«3<A% 




(c) = {g = {l,2,3},/ l = {2,3}}. 



(d) g = {g = {l}Uh,h = {2,3}}. 



Figure 1: Graph representation of simple proximal problems with different group structures Q. The 
three indices 1,2,3 are represented as grey squares, and the groups g, h in Q as red discs. The 
source is linked to every group g, h with respective maximum capacity Xrjg, \rjh and zero cost. Each 
variable Uj is linked to the sink t, with an infinite capacity, and with a cost Cj = |(u.j — £j) 2 ■ All 
other arcs in the graph have zero cost and infinite capacity. They represent inclusion relations in- 



between groups, and between groups and variables. The graphs (c) and | (d) | correspond to a special 
case of tree-structured hierarchy in the sense of [TT]. Their min-cost flow problems are equivalent. 
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3.2 Computation of the Proximal Operator 

Quadratic min-cost flow problems have been well studied in the operations research literature [19J . 



One of the simplest cases, where Q contains a single group g as in Figure 1(a) can be solved by an 
orthogonal projection on the €i-ball of radius Xr] g . It has been shown, both in machine learning [20J 
and operations research JTSJ , that such a projection can be done in 0(p) operations. When the 



group structure is a tree as in Figure 1(d) strategies developed in the two communities are also 
similar [TT1 119| . and solve the problem in 0(pd) operations, where d is the depth of the tree. 

The general case of overlapping groups is more difficult. Hochbaum and Hong have shown in |19) 
that quadratic min-cost flow problems can be reduced to a specific parametric max-flow problem, 
for which an efficient algorithm exists [22 Q While this approach could be used to solve Eq. (Q}, 
it ignores the fact that our graphs have non-zero costs only on edges leading to the sink. To take 
advantage of this specificity, we propose the dedicated Algorithm [TJ Our method clearly shares 
some similarities with a simplified version of [22J presented in |23j . namely a divide and conquer 
strategy. Nonetheless, we performed an empirical comparison described in Appendix [D] which 
shows that our dedicated algorithm has significantly better performance in practice. Informally, 



Algorithm 1 Computation of the proximal operator for overlapping groups, 
l: Inputs: u £ M p , a set of groups Q, positive weights (%) g eg, and A (regularization parameter). 

2: Build the initial graph Go = (Vo,Eo, s,t) as explained in Section [XU 

3: Compute the optimal flow: £ <— computeFlow(Vb, Eq). 

4: Return: w = u £ (optimal solution of the proximal problem). 

Function computeFlow(V^ = V u U V gr , E) 

1: Projection step: 7 <- argmin^ Ej eVu U n i - 1 'if S - L EjeK T? ^ X ^g&v gr %■ 
2: For all nodes j in V U) set 7- to be the capacity of the arc (j,t). 
3: Max-flow step: Update {£j)jev n by computing a max-flow on the graph (V,E, s,t). 
4: if 3 j e V u s.t. j£ then 

5: Denote by (s, V + ) and (V~,t) the two disjoint subsets of (V, s, t) separated by the minimum 
(s, t)-cut of the graph, and remove the arcs between V + and V~ . Call E + and E~ the two 
remaining disjoint subsets of E corresponding to V + and V~ . 

6: (tj)jev+ <~ computeFlow(V"+,i;+). 

7: (£j)jeVv *~ compnteFJ-ow^ - , E~). 
8: end if 

9: Return: (£j)jev u - 



computeFlow(Vo, Eq) returns the optimal flow vector £, proceeding as follows: This function first 
solves a relaxed version of problem Eq. (U) obtained by replacing the sum of the vectors £ 9 by a 
single vector 7 whose ^i-norm should be less than, or equal to, the sum of the constraints on the 
vectors £ 9 . The optimal vector 7 therefore gives a lower bound ||u — 7||f/2 on the optimal cost. 
Then, the maximum-flow step [24J tries to find a feasible flow such that the vector £ matches 7. 
If £ = 7, then the cost of the flow reaches the lower bound, and the flow is optimal. If £ ^ 7, 
the lower bound cannot be reached, and we construct a minimum (s, £)-cut of the graph |25) that 
defines two disjoints sets of nodes V + and V~; V + is the part of the graph that can potentially 
receive more flow from the source, whereas all arcs linking s to V~ are saturated. The properties 
of a min (s, £)-cut [55] imply that there are no arcs from V + to V~ (arcs inside V have infinite 

4 By definition, a parametric max-flow problem consists in solving, for every value of a parameter, a max-flow 
problem on a graph whose arc capacities depend on this parameter. 
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capacity by construction), and that there is no flow on arcs from V~ to V + . At this point, it 
is possible to show that the value of the optimal min-cost flow on these arcs is also zero. Thus, 
removing them yields an equivalent optimization problem, which can be decomposed into two 
independent problems of smaller size and solved recursively by the calls to computeFlow(V + , E + ) 
and computeFlow(V~, E~). Note that when fi is the ^i-norm, our algorithm solves problem ^ 
during the first projection step in line 1 and stops. A formal proof of correctness of Algorithm Q] 
and further details are relegated to Appendix [B] 

The approach of 1!) ;22] is guaranteed to have the same worst-case complexity as a single max- 
flow algorithm. However, we have experimentally observed a significant discrepancy between the 
worst case and empirical complexities for these flow problems, essentially because the empirical cost 
of each max-flow is significantly smaller than its theoretical cost. Despite the fact that the worst- 
case guarantee of our algorithm is weaker than their (up to a factor |V|), it is more adapted to the 
structure of our graphs and has proven to be much faster in our experiments (see supplementary 
material) . 

Some implementation details are crucial to the efficiency of the algorithm: 

• Exploiting maximal connected components: When there exists no arc between two 
subsets of V, it is possible to process them independently to solve the global min-cost 
flow problem. To that effect, before calling the function computeFlow(y, E), we look for 
maximal connected components (V±, E±), . . . , (Vn, En) and call sequentially the procedure 
computeFlow(V;, Ei) for i in [l;iV]. 

• Efficient max-flow algorithm: We have implemented the "push-relabel" algorithm of |24) 
to solve our max-flow problems, using classical heuristics that significantly speed it up in 
practice (see |24[ 127) ). Our implementation uses the so-called "highest- active vertex selection 
rule, global and gap heuristics" (see [241127) ). and has a worst-case complexity of 0(| V j 2 1-E7I 1 / 2 ) 
for a graph (V,E,s,t). This algorithm leverages the concept of pre- flow that relaxes the 
definition of flow and allows vertices to have a positive excess. 

• Using flow warm-restarts: Our algorithm can be initialized with any valid pre-flow, en- 
abling warm-restarts when the max-flow is called several times as in our algorithm. 

• Improved projection step: The first line of the procedure computeFlow can be replaced by 
7 <- argmiHy^.g^ \{n 3 - 7j ) 2 s.t. T, je v u 7j < X ]2 g ev gr % and | 7j | < XY, g3 j Vg- The 
idea is that the structure of the graph will not allow ^ to be greater than A Ylgsj Vg after 
the max-flow step. Adding these additional constraints leads to better performance when the 
graph is not well balanced. This modified projection step can still be computed in linear 
time [21]. 

3.3 Computation of the Dual Norm 

The dual norm O* of f2, defined for any vector k in MP by Q*(k) = rriaxnr z )<i z t k, is a key quantity 
to study sparsity-inducing regularizations [5] [HI [25]. We use it here to monitor the convergence of 
the proximal method through a duality gap, and define a proper optimality criterion for problem (fT]). 
We denote by /* the Fenchel conjugate of / [2§], defined by = sup z [z T K — f(z)}. The duality 

gap for problem ([T]) can be derived from standard Fenchel duality arguments [2§] and it is equal 
to /(w) + Af2(w) + /*(— k) for w, k in K p with f2*(/c) < A. Therefore, evaluating the duality gap 
requires to compute efficiently J7* in order to find a feasible dual variable k. This is equivalent to 
solving another network flow problem, based on the following variational formulation: 

0*(k) = min t s.t. = n, and Vg £ Q, U 9 \\i < TT] q with £?=0ifj^.g. (5) 

CcRpxISI ' 

5 geQ 
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In the network problem associated with (|12p . the capacities on the arcs (s, g), g £ Q, are set to Trig, 
and the capacities on the arcs (j,t), j in [l;p], are fixed to Kj. Solving problem (|12p amounts to 
finding the smallest value of r, such that there exists a flow saturating the capacities Kj on the arcs 
leading to the sink t (i.e., £ = k). Equration (fT2"]) and the algorithm below are proven to be correct 
in Appendix [Bl 



Algorithm 2 Computation of the dual norm, 
l: Inputs: k G MP, a set of groups Q, positive weights {rj g ) g ^g. 
2: Build the initial graph Go = (Vq, Eq, s,t) as explained in Section [5751 
3: t i — dualNorm(Vo, E ). 
4: Return: r (value of the dual norm). 

Function dualNorm(l/ = K U V sr , £) 

l: r ^(X)jGV K i)/(S ff ey Vg) an d se t the capacities of arcs (s,g) to tt7 s for all g in V 9r . 

2: Max-flow step: Update (£j-)jev« by computing a max-flow on the graph (V,E,s,t). 

3: if 3 j £ s.t. £ ■ ^ Kj then 

4: Define (U+,£+) and (V~,E~) as in Algorithm □ and set r <- dualNorm(U~, E~). 

5: end if 

6: Return: r. 



4 Applications and Experiments 

Our experiments use the algorithm of [5] based on our proximal operator, with weights r/ g set to 1. 
We present this algorithm in more details in Appendix [C] 

4.1 Speed Comparison 

We compare our method (ProxFlow) and two generic optimization techniques, namely a subgradient 
descent (SG) and an interior point methodjj on a regularized linear regression problem. Both 
SG and ProxFlow are implemented in C++. Experiments are run on a single-core 2.8 GHz CPU. 
We consider a design matrix X in R raxp built from overcomplete dictionaries of discrete cosine 
transforms (DCT), which are naturally organized on one- or two-dimensional grids and display local 
correlations. The following families of groups Q using this spatial information are thus considered: 
(1) every contiguous sequence of length 3 for the one-dimensional case, and (2) every 3 x 3-square 
in the two-dimensional setting. We generate vectors y in 1" according to the linear model y = 
Xwq+£, where e ~ Af(0, 0.01||Xwo||2)- The vector Wo has about 20% percent nonzero components, 
randomly selected, while respecting the structure of Q, and uniformly generated between [—1, 1]. 

In our experiments, the regularization parameter A is chosen to achieve this level of sparsity. 
For SG, we take the step size to be equal to a/{k + b), where k is the iteration number, and 
(a, b) are the best parameters selected in {10 -3 , . . . , 10} x {10 2 , 10 3 , 10 4 }. For the interior point 
methods, since problem ([T]) can be cast either as a quadratic (QP) or as a conic program (CP), 
we show in Figure [5] the results for both formulations. Our approach compares favorably with the 
other methods, on three problems of different sizes, (n,p) G {(100, 10 3 ), (1024, 10 4 ), (1024, 10 5 )}, see 
Figure [5J In addition, note that QP, CP and SG do not obtain sparse solutions, whereas ProxFlow 
does. We have also run ProxFlow and SG on a larger dataset with (n,p) = (100, 10 6 ): after 12 
hours, ProxFlow and SG have reached a relative duality gap of 0.0006 and 0.02 respectively!^ 

5 In our simulations, we use the commercial software Mosek, http://www.mosek.com/ 
6 Due to the computational burden, QP and CP could not be run on every problem. 
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n=1 00, p=1 000, one-dimensional DCT _ n=1 024, p=1 0000, two-dimensional DCT _ n=1 024, p=1 00000, one-dimensional DCT 




log(CPU time) in seconds log(CPU time) in seconds log(CPU time) in seconds 



Figure 2: Speed comparisons: distance to the optimal primal value versus CPU time (log-log scale). 
Due to the computational burden. QP and CP could not be run on every problem. 

4.2 Background Subtraction 

Following [Bj, we consider a background subtraction task. Given a sequence of frames from a 
fixed camera, we try to segment out foreground objects in a new image. If we denote by y 6 M™ 
this image composed of n pixels, we model y as a sparse linear combination of p other images 
X G IR rl><p , plus an error term e in R n , i.e., y « Xw + e for some sparse vector w in W. This 
approach is reminiscent of [30J in the context of face recognition, where e is further made sparse 
to deal with small occlusions. The term Xw accounts for background parts present in both y 
and X, while e contains specific, or foreground, objects in y. The resulting optimization problem is 
nrm w ,e i||y— Xw— e||| + Ai ||w||i + A2||e||i, with Ai, A2 > 0. In this formulation, the £i-norm penalty 
on e does not take into account the fact that neighboring pixels in y are likely to share the same 
label (background or foreground), which may lead to scattered pieces of foreground and background 
regions (Figure [3]). We therefore put an additional structured regularization term f2 on e, where 
the groups in Q are all the overlapping 3 x 3-squares on the image. A dataset with hand-segmented 
evaluation images is used to illustrate the effect of f20 For simplicity, we use a single regularization 
parameter, i.e., Ai = A2, chosen to maximize the number of pixels matching the ground truth. We 
consider p — 200 images with n = 57600 pixels (i.e., a resolution of 120 x 160, times 3 for the RGB 
channels). As shown in Figure [21 adding Q, improves the background subtraction results for the two 
tested images, by removing the scattered artifacts due to the lack of structural constraints of the 
^i-norm, which encodes neither spatial nor color consistency. 

4.3 Multi-Task Learning of Hierarchical Structures 

In |11) , Jenatton et al. have recently proposed to use a hierarchical structured norm to learn dictio- 
naries of natural image patches. Following their work, we seek to represent n signals {y 1 , . . . ,y™} 
of dimension m as sparse linear combinations of elements from a dictionary X = [x 1 , . . . ,x p ] in 
U>mxp This can be expressed for all i in [l;n] as y 1 « Xw 4 , for some sparse vector w 1 in W. In 
|11| . the dictionary elements are embedded in a predefined tree T, via a particular instance of the 
structured norm f2, which we refer to it as n tre e) and call Q the underlying set of groups. In this 
case, each signal y l admits a sparse decomposition in the form of a subtree of dictionary elements. 

Inspired by ideas from multi-task learning |16| . we propose to learn the tree structure T by 
pruning irrelevant parts of a larger initial tree 7o- We achieve this by using an additional regular- 
ization term fijoint across the different decompositions, so that subtrees of 7o will simultaneously 
be removed for all signals y\ In other words, the approach of [TT| is extended by the following 

'''http : / /research .microsoft . com/ en-us/um/people/ jckrumm/wallf lower /test images . htm 
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Figure 3: The original image y (column 1), the background (i.e., Xw) reconstructed by our method 
(column 2), and the foreground (i.e., the sparsity pattern of e as a mask on the original image) 
detected with l\ (column 3) and with (column 4). The rightmost column is another foreground 

found with CI, on a different image, with the same values of Ai, A2 as for the previous image. For the 
top left image, the percentage of pixels matching the ground truth is 98.8% with SI, 87.0% without. 
As for the bottom left image, the result is 93.8% with ft, 90.4% without (best seen in color). 



formulation: 
• If 



mm 

x.w n ^-~>L2 

i=l 



r - Xw*||| + A 1 fi t ree(w i ) + A 2 Jl joint (W) , s.t. ||x*|| 2 < 1, for all j in [l;p], (6) 



where W = [w 1 , . . . , w n ] is the matrix of decomposition coefficients in R pxn . The new regulariza- 
tion term operates on the rows of W and is defined as Slj mt(W) = X^ec? max ie[i;n] l w g|0 The 
overall penalty on W, which results from the combination of fitree and fijoint, is itself an instance 
of O with general overlapping groups, as defined in Eq @. 

To address problem ((S]), we use the same optimization scheme as i.e., alternating between X 
and W, fixing one variable while optimizing with respect to the other. The task we consider is the 
denoising of natural image patches, with the same dataset and protocol as [11]. We study whether 
learning the hierarchy of the dictionary elements improves the denoising performance, compared to 
standard sparse coding (i.e., when fitree is the ^i-norm and A 2 = 0) and the hierarchical dictionary 
learning of [TTj based on predefined trees (i.e., A 2 = 0). The dimensions of the training set — 50 000 
patches of size 8x8 for dictionaries with up to p = 400 elements — impose to handle extremely large 
graphs, with \E\ « | | « 4.10 7 . Since problem © is too large to be solved exactly sufficiently many 
times to select the regularization parameters (Ai, A 2 ) rigorously, we use the following heuristics: we 
optimize mostly with the currently pruned tree held fixed (i.e., A 2 =0), and only prune the tree (i.e., 
A 2 > 0) every few steps on a random subset of 10 000 patches. We consider the same hierarchies 
as in |llj . involving between 30 and 400 dictionary elements. The regularization parameter Ai 
is selected on the validation set of 25 000 patches, for both sparse coding (Flat) and hierarchical 
dictionary learning (Tree). Starting from the tree giving the best performance (in this case the 
largest one, see Figure 2]), we solve problem ([5]) following our heuristics, for increasing values of A 2 . 
As shown in Figure 21 there is a regime where our approach performs significantly better than the 
two other compared methods. The standard deviation of the noise is 0.2 (the pixels have values in 
[0, 1]); no significant improvements were observed for lower levels of noise. 



8 The simplified case where f2tree and Oj ; nt are the l\- and mixed ^i/£2-norms 1141 corresponds to 1311 . 
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Figure 4: Left: Hierarchy obtained by pruning a larger tree of 76 elements. Right: Mean square 
error versus dictionary size. The error bars represent two standard deviations, based on three runs. 

5 Conclusion 

We have presented a new optimization framework for solving sparse structured problems involving 
sums of £oo-norms of any (overlapping) groups of variables. Interestingly, this sheds new light on 
connections between sparse methods and the literature of network flow optimization. In particular, 
the proximal operator for the formulation we consider can be cast as a quadratic min-cost flow 
problem, for which we propose an efficient and simple algorithm. This allows the use of accelerated 
gradient methods. Several experiments demonstrate that our algorithm can be applied to a wide 
class of learning problems, which have not been addressed before within sparse methods. 
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A Equivalence to Canonical Graphs 

Formally, the notion of equivalence between graphs can be summarized by the following lemma: 
Lemma 2 (Equivalence to canonical graphs.) 

Let G = (V, E, s, t) be the canonical graph corresponding to a group structure Q with weights {jj g ) g ^g. 
Let G' = (V, E' ', s,t) be a graph sharing the same set of vertices, source and sink as G, but with a 
different arc set E' . We say that G' is equivalent to G if and only if the following conditions hold: 

• Arcs of E' outgoing from the source are the same as in E, with the same costs and capacities. 

• Arcs of E' going to the sink are the same as in E, with the same costs and capacities. 

• For every arc (g,j) in E, with (g,j) in V gr x V u , there exists a unique path in E' from g to j 
with zero costs and infinite capacities on every arc of the path. 

• Conversely, if there exists a path in E' between a vertex g in V gr and a vertex j in V u , then 
there exists an arc (g,j) in E. 

Then, the cost of the optimal min-cost flow on G and G' are the same. Moreover, the values of the 
optimal flow on the arcs (j,t), j inV u , are the same on G and G' . 

Proof. We first notice that on both G and G', the cost of a flow on the graph only depends on 
the flow on the arcs (j, t), j in V u , which we have denoted by £ in E. 

We will prove that finding a feasible flow 1 on G with a cost c(tt) is equivalent to finding a 
feasible flow tt' on G' with the same cost c(tt) = c(ir'). We now use the concept of path flow, which 
is a flow vector in G carrying the same positive value on every arc of a directed path between two 
nodes of G. It intuitively corresponds to sending a positive amount of flow along a path of the 
graph. 

According to the definition of graph equivalence introduced in the Lemma, it is easy to show 
that there is a bijection between the arcs in E, and the paths in E' with positive capacities on every 
arc. Given now a feasible flow tt in G, we build a feasible flow tt' on G' which is a sum of path 
flows. More precisely, for every arc a in E, we consider its equivalent path in E' , with a path flow 
carrying the same amount of flow as a. Therefore, each arc a' in E' has a total amount of flow that 
is equal to the sum of the flows carried by the path flows going over a'. It is also easy to show that 
this construction builds a flow on G' (capacity and conservation constraints are satisfied) and that 
this flow tt' has the same cost as tt, that is, c(tt) = c(tt'). 

Conversely, given a flow tt' on G' , we use a classical path flow decomposition (see Proposition 
1.1 in [55]), saying that there exists a decomposition of tt' as a sum of path flows in E' . Using the 
bijection described above, we know that each path in the previous sums corresponds to a unique 
arc in E. We now build a flow tt in G, by associating to each path flow in the decomposition of tt' , 
an arc in E carrying the same amount of flow. The flow of every other arc in E is set to zero. It is 
also easy to show that this builds a valid flow in G that has the same cost as tt'. ■ 



B Convergence Analysis 

We show in this section the correctness of Algorithm [1] for computing the proximal operator, and 
of Algorithm [5] for computing the dual norm fi*. 
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B.l Computation of the Proximal Operator 

We now prove that our algorithm converges and that it finds the optimal solution of the proximal 
problem. This requires that we introduce the optimality conditions for problem Q derived in |11| , 
since our convergence proof essentially checks that these conditions are satisfied upon termination 
of the algorithm. 

Lemma 3 (Optimality conditions of the problem (|4]), [11]) The primal- dual variables (w, £) 
are respectively solutions of the primal |3J] and dual problems if and only if the dual variable £ 
is feasible for the problem ^ and 

v g { wj€» = iKIUUHi ™d W\\i = Xv g , 
' \ or vf g = 0. 

Note that these optimality conditions provide an intuitive view of our min-cost flow problem. 
Solving the min-cost flow problem is equivalent to sending the maximum amount of flow in the 
graph under the capacity constraints, while respecting the rule that the flow outgoing from a group 
g should always be directed to the variables Uj with maximum residual Uj — X^eS^j - 

Before proving the convergence and correctness of our algorithm, we also recall classical prop- 
erties of the min capacity cuts, which we intensively use in the proofs of this paper. The procedure 
computeFlow of our algorithm finds a minimum (s,t)-cut of a graph G — (V,E,s,t), dividing the 
set V into two disjoint parts V + and V~. V + is by construction the sets of nodes in V such that 
there exists a non-saturating path from s to V, while all the paths from s to V~ are saturated. 
Conversely, arcs from V + to t are all saturated, whereas there can be non-saturated arcs from V~ 
to t. Moreover, the following properties hold 

• There is no arc going from V + to V~ . Otherwise the value of the cut would be infinite. (Arcs 
inside V have infinite capacity by construction of our graph). 

• There is no flow going from V~ to V + (see properties of the minimum (s, i)-cut |26|). 

• The cut goes through all arcs going from V + to t, and all arcs going from s to V~. 

All these properties are illustrated on Figure [5] 

Recall that we assume (cf. Section I3.1[) that the scalars Uj are all non negative, and that we 
add non-negativity constraints on With the optimality conditions of Lemma [5] in hand, we can 
show our first convergence result. 

Proposition 1 (Convergence of Algorithm [T|) 

Algorithm]]] converges in a finite and polynomial number of operations. 

Proof. Our algorithm splits recursively the graph into disjoints parts and processes each part 
recursively. The processing of one part requires an orthogonal projection onto an £i-ball and a 
max-flow algorithm, which can both be computed in polynomial time. To prove that the procedure 
converges, it is sufficient to show that when the procedure computeFlow is called for a graph 
(V,E,s,t) and computes a cut (V + ,V~), then the components V + and V~ are both non-empty. 

Suppose for instance that V~= 0. In this case, the capacity of the min-cut is equal to J2jev Tj> 
and the value of the max-flow is J2j^v £j- Using the classical max-flow/min-cut theorem [35], we 
have equality between these two terms. Since, by definition of both 7 and £, we have for all j in V u , 
ij < 7 j ! we obtain a contradiction with the existence of j in V u such that £ • ^ 7^ . 

Conversely, suppose now that V + = 0. Then, the value of the max-flow is still J2jev C/> an< ^ 
the value of the min-cut is A^ jel/ rj g . Using again the max-flow/min-cut theorem, we have that 
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Figure 5: Cut computed by our algorithm. V + = V+ U V g +, with = {g}, V+ = {1,2}, and 
V~ = V~ U V grl with Vz. — {h}, V~ = {3}. Arcs going from s to V~ are saturated, as well as arcs 
going from V + to t. Saturated arcs are in bold. Arcs with zero flow are dotted. 

EjeVuC' = X T ig ev gr 7 l9- Moreover, by definition of 7, we also have Ej e y„ £j < J2 3 ev u *Tj ^ 
A J2 g ev Vg> leading to a contradiction with the existence of j in V u such that ^ ^ 7^. This proof 
holds for any graph that is equivalent to the canonical one. ■ 

After proving the convergence, we prove that the algorithm is correct with the next proposition. 

Proposition 2 (Correctness of Algorithm [T]) 

Algorithm^ solves the proximal problem of Eq. f3|). 

Proof. For a group structure Q, we first prove the correctness of our algorithm if the graph used 
is its associated canonical graph that we denote Go = (Vb, Eo, s, t). We proceed by induction on 
the number of nodes of the graph. The induction hypothesis 'H(fc) is the following: 

For all canonical graphs G — (V = V u U V gr ,E,s,t) associated with a group structure Gv with 
weights (%) g ec?v suc h that \V\ < k, computeFlow{V, E) solves the following optimization problem: 

min \^ - E sA - V -9 e V- E % ^ A % and = °> V * 9- ^ 

Since £/v& = G> it is sufficient to show that %(|Vb|) to prove the proposition. 

We initialize the induction by "H(2), corresponding to the simplest canonical graph, for which 
\V gr \ — \V U \ — 1). Simple algebra shows that Tl(2) is indeed correct. 

We now suppose that H(k') is true for all k 1 < k and consider a graph G — (V, E, s, t), \V\ = k. 
The first step of the algorithm computes the variable (7 3 -)j'eV« by a projection on the ^i-ball. This 
is itself an instance of the dual formulation of Eq. (U) in a simple case, with one group containing all 
variables. We can therefore use Lemma|3]to characterize the optimality of ('Yj)jev u : which yields 

f £j 6 v„ K ~ 7j)7j = ( naax jeK |u* - i j 1) E je v u Ij and £j 6 y„ 1j = X £<?ev 9r Vg, (g) 
\ or u j -"f j =Q, Vj eV u . { ' 
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The algorithm then computes a max-flow, using the scalars y ■ as capacities, and we now have two 
possible situations: 

1. If £j = ~fj for all j in V u , the algorithm stops; we write Wj = Uj — £^ for j in V u , and using 
Eq. ([5]), we obtain 

SjeK w iC/ = ( max jev„ |Wj-|) EjeK <j and £j€V« £f = A E se v, r ^ 
or Wj = 0, Vj £ K, 1 ^ 

We can rewrite the condition above as 



E E w ^? = E (p* wE < 



Since all the quantities in the previous sum are positive, this can only hold if for all g G V gr , 

E w ^f = te a , x i w ii)E^- 

jev u " jev u 

Moreover, by definition of the max flow and the optimality conditions, we have 

V 5 e V gr , ]T €' ^ and E = A E 

jev tt jev u g ev gr 

which leads to 

By Lemma [3J we have shown that the problem ([7J is solved. 

2. Let us now consider the case where there exists j in V u such that £■ ^ -y -. The algorithm 
splits the vertex set V into two parts V + and , which we have proven to be non-empty in 
the proof of Proposition [T] The next step of the algorithm removes all edges between V + and 
V~ (see Figure [5]). Processing (V + ,E + ) and (V~,E~) independently, it updates the value 
of the flow matrix j G V u , g € V gr , and the corresponding flow vector ; j G V u . As for 
V, we denote by V+ = V+ n K, K" - ^~ n K and 7+ = n Vg r , V~ — V~ f) V gr . 

Then, we notice that (V + , E + . s, t) and (V - , E~~ , s, t) are respective canonical graphs for the 
group structures G v + = {g n V+ \ g G K,,.}, and C/ v - = {g n | g G V" gr }. 

Writing Wj = Uj — £ ■ for j in V^, and using the induction hypotheses %(|y + |) and 

we now have the following optimality conditions deriving from Lemma [3] applied on Eq. ([7]) 

respectively for the graphs (V + ,E + ) and (V~,E~): 

v s ey;,.^ 9 ne, j ™J'& =J™A°°E,e 9 ^ ^ E jes ^f = A%, (10) 

and 

= ll w s'll°oE,w£? and E,e 9 '^? = A? ?: 



V.9 G 7-,</ 4 ff n KT, [ or w ; = j 9 ~ 

We will now combine Eq. (fTU| and Eq. (jTTJ) into optimality conditions for Eq. ([7]). We first 
notice that g D = g since there are no arcs between V + and in E (see the properties 
of the cuts discussed before this proposition). It is therefore possible to replace g 1 by g in 
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Eq. (fTP|) . We will show that it is possible to do the same in Eq. (fTTj) , so that combining these 
two equations yield the optimality conditions of Eq. Q . 

More precisely, we will show that for all g g V~, and j g g n V^ 1 ", |wj| < max ;ejnV ,- |wj|, 
in which case g' can be replaced by g in Eq. (Ilip . This result is relatively intuitive: (s, V^ + ) 
and (V~,t) being an (s,t)-cut, all arcs between s and are saturated, while there are 
unsaturated arcs between s and V + ; one therefore expects the residuals Uj — £ ■ to decrease 
on the V + side, while increasing on the V~ side. The proof is nonetheless a bit technical. 

Let us show first that for all g in Vi, Hw^H^ < maxj e y u |uj — 7j|- We split the set V + into 
disjoint parts: 

V g V - {fJ e V+ s.t. W^gW^ < max \uj - 7^}, 

V++ 4 {j g F+ s.t. 3 5 g j g g}, 

V g Y =V+\V++ = { 9 EV+ s.t. |K|| oo >max|u,- 7j |}, 

v+- = v + \ v ++ 

As previously, we denote V + ~ = V+~UV+~ and V ++ =V+ + U V+ + . We want to show that 
is necessarily empty. We reason by contradiction and assume that ^ 0. 

According to the definition of the different sets above, we observe that no arcs are going from 
V ++ to , that is, for all g in V+ + , g D = 0. We observe as well that the flow from 
V+,~ to V+ + is the null flow, because optimality conditions (TIU]) imply that for a group g only 
nodes j g g such that wj = HwgHoo receive some flow, which excludes nodes in + provided 
^ 0; Combining this fact and the inequality J2 g ey + A % — 12jev + 1j (which is a direct 
consequence of the minimum (s,i)-cut), we have as well 

E A %> E -rj- 

seVgt" jev+~ 

Let j g , if £j ^ then for some g £ Vg%~ such that j receives some flow from g, which 
from the optimality conditions (TlT)|) implies Wj = Hw^H^; by definition of , Hw^Hoo > 
u 3 — 7^ . But since at the optimum, Wj = Uj — £ • , this implies that £^ < 7^ , and in turn that 
E je v+- Cj = A E 3G v+- Finally, 

A E % = E C < E ^ 

and this is a contradiction. 

We now have that for all g in V+. [[w^H^ < maxj e v u |uj — 7 -|. The proof showing that for 
all g in V~, HwgH^ > maxjgy^ — 7y|, uses the same kind of decomposition for V~ , and 
follows along similar arguments. We will therefore not detail it. 

To recap, we have shown that for all g £ and j g g PI |wj| < max ie9nl/ - |w;|. Since 
there is no flow from V~ to V + , i.e., = for g in and j in we can now replace the 
definition of g' in Eq. (TlTj) by g' = g D V u , the combination of Eq. (|10[) and Eq. (fTTj) gives us 
optimality conditions for Eq. Q. 

The proposition being proved for the canonical graph, we extend it now for an equivalent graph 
in the sense of Lemma [H First, we observe that the algorithm gives the same values of 7 for two 
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equivalent graphs. Then, it is easy to see that the value £ given by the max-flow, and the chosen 
(s,i)-cut is the same, which is enough to conclude that the algorithm performs the same steps for 
two equivalent graphs. ■ 



B.2 Computation of the Dual Norm Q* 

Similarly to the proximal operator, the computation of dual norm fi* can itself shown to solve 
another network flow problem, based on the following variational formulation, which extends a 
previous result from [5]: 

Lemma 4 (Dual formulation of the dual-norm il*.) 

Let KE« p . We have 

0*(k)= min t s.t. V£ 9 = k. and \/g S G, U a \\i < Tn q with & = 0ifj£g. (12) 
5 geg 

Proof. By definition of f2*(/c), we have 

Q*(k) = max z T k. 
n(z)<i 

By introducing the primal variables {otg) g <zg € we can rewrite the previous maximization 

problem as 

Q*(k) = max k t z, s.t. VgEG, ||z g ||oo < a 9 , 

with the additional \G\ conic constraints Hz^H^ < a g . This primal problem is convex and satisfies 
Slater's conditions for generalized conic inequalities, which implies that strong duality holds [52] , 
We now consider the Lagrangian C defined as 

C{i,a g ,T, lg ,£) = K T z + T(l-Y,V g a g ) + Yl (?) (ll) ' 

geg geg VZg/ \W 

with the dual variables {r, (7 g ) 5e g,£} 6 M + xRl e l x]RP x l e l such that for all g € Q, = if j £ g 
and ||£ ff ||i < 7 g . The dual function is obtained by taking the derivatives of C with respect to the 
primal variables z and (a g ) g t=g and equating them to zero, which leads to 

••./' ii /'}• >>j -y ; ,.,:C: =o 

Vg € G, Trig ~j g =0. 
After simplifying the Lagrangian and flipping the sign of the dual problem then reduces to 

min T s t /Vj - !!•••• ■!>]■ *i -Egeg C ^d $ = if U g, 

€6Kpxisi, T \y g e G, U 9 \\l < TTJg, 

which is the desired result. ■ 
We now prove that Algorithm [5] is correct. 

Proposition 3 (Convergence and correctness of Algorithm [2]) 

Algorithm^ computes the value of the dual norm of Eq. jlfy) in a finite and polynomial number of 
operations. 
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Proof. The convergence of the algorithm only requires to show that the cardinality of V in the 
different calls of the function computeFlow strictly decreases. Similar arguments to those used in 
the proof of Proposition Q] can show that each part of the cuts (V + , V~) are both non-empty. The 
algorithm thus requires a finite number of calls to a max-flow algorithm and converges in a finite 
and polynomial number of operations. 

Let us now prove that the algorithm is correct for a canonical graph. We proceed again by 
induction on the number of nodes of the graph. More precisely, we consider the induction hypothesis 
%'{k) defined as: 

for all canonical graphs G — (V, E, s, t) associated with a group structure Qy and such that \V\ < k. 
dualNormAux(V = V u U V gr , E) solves the following optimization problem: 

minr s.t. Vj € V u , Kj = £ and V 5 G V gr , £ $ < TT) g with = if j £ 9- (13) 

gev gr jeV„ 

We first initialize the induction by H(2) (i.e., with the simplest canonical graph, such that \V gr \ — 
\V U \ = 1). Simple algebra shows that H(2) is indeed correct. 

We next consider a canonical graph G = (V, E, s, t) such that \ V\ — k, and suppose that H'(k-l) 
is true. After the max-flow step, we have two possible cases to discuss: 

1. If fj = jj for all j in V u , the algorithm stops. We know that any scalar r such that the 
constraints of Eq. (fT3")) are all satisfied necessarily verifies J2 g ev Tr lg — ^2jev K v We nave 
indeed that ^2 g£V Tr\ g is the value of an (s, i)-cut in the graph, and ^2jev K J IS ^ e vame °f 
the max-flow, and the inequality follows from the max-flow/min-cut theorem |25) . This gives 
a lower-bound on r. Since this bound is reached, r is necessarily optimal. 

2. We now consider the case where there exists j in V u such that £^ ^ Kj, meaning that for the 
given value of t, the constraint set ofEq. (fl3|) is not feasible for ^, and that the value of r should 
necessarily increase. The algorithm splits the vertex set V into two non-empty parts V + and 
V~ and we remark that there are no arcs going from V + to V~, and no flow going from V~ to 
V + . Since the arcs going from s to V~ are saturated, we have that ^2 g£V - Tr\ g < X^ey- K j- 
Let us now consider t* the solution of Eq. ([TU1) . Using the induction hypothesis "H'dV^I), 
the algorithm computes a new value r' that solves Eq. (fTS)) when replacing V by V~ and 
this new value satisfies the following inequality 5Z ffg y- T'?y g > ^2j eV - Kj. The value of t' 
has therefore increased and the updated flow £ now satisfies the constraints of Eq. (|13[) and 
therefore r' > t* . Since there are no arcs going from V + to V - , t* is feasible for Eq. (fT3")) 
when replacing V by V~ and we have that t* > t' and then r' = r*. 

To prove that the result holds for any equivalent graph, similar arguments to those used in the 
proof of Proposition Q] can be exploited, showing that the algorithm computes the same values of r 
and same (s, t)-cuts at each step. ■ 



C Algorithm FISTA with duality gap 

In this section, we describe in details the algorithm FISTA [3] when applied to solve problem ([1]), 
with a duality gap as stopping criterion. 

Without loss of generality, let us assume we are looking for models of the form Xw, for some 
matrix X G M. nxp (typically, linear models where X is the data matrix of n observations). Thus, 
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we can consider the following primal problem 

min /(Xw) + AQ(w), (14) 

in place of ([1]). Based on Fenchel duality arguments |29| . 

/(Xw) + Afi(w) + /*(-«), for w G R p ,k G R™ and 0*(X t k) < A, 

is a duality gap for HU). where = sup z [z T M — /(z)] is the Fenchel conjugate of /. Given a 

primal variable w, a good dual candidate k can be obtained by looking at the conditions that have 
to be satisfied by the pair (w, k) at optimality |29| . In particular, the dual variable k is chosen to 
be 

k = -p-'V/fXw), with p = max{A- 1 rj*(X T V/(Xw)),l}. 

Consequently, computing the duality gap requires evaluating the dual norm f2*. We sum up the 
computation of the duality gap in Algorithm [3] 

Moreover, we refer to the proximal operator associated with AJ7 as prox[ A Q] . As a brief reminder, 
it is defined as the function that maps the vector u in W to the (unique, by strong convexity) solution 
ofEq. ©. 



Algorithm 3 FISTA procedure to solve problem (fl"4"|) . 

l: Inputs: initial wvq) G M. p , ft, A > 0, £ gap > (precision for the duality gap). 

2: Parameters: v > 1, Lq > 0. 

3: Outputs: solution w. 

4: Initialization: yn) = W( ), t\ = 1, k = 1. 

5: while { computeDualityGap(w(fc_!)) > e ga p} do 

6: Find the smallest integer Sk >0 such that 

7: /(prox [AO] (y (fc) )) < /(y (fc) ) + Aj k) V/(y (fc) ) + £||A (fc) ||i, 

8: with L = L k v Sk and A (fe) = y (fc) -prox [AO] (y (fe) ). 

9: L k ^L k _ x v Sk . 

10: w (fe) prox [A fi] (y (fc) ). 

11: t fe+1 <- (1 + v /lT7f)/2. 

12: y ( fc+l) <-W( fc ) + ^-(w (fe) -W^r)). 
13: k 4- k+ 1. 

14: end while 

15: Return: w W( fe _!). 

Procedure computeDualityGap(w) 

1: K<--p- 1 V/(Xw), with p = max {A- 1 ^*(X T V/(Xw)),l}. 
2: Return: /(Xw) + AO(w) + /*(-k). 



D Additional Experimental Results 

D.l Speed comparison of Algorithm Q] with parametric max-flow algo- 
rithms 

As shown in |19j . min-cost flow problems, and in particular, the dual problem of ([3]), can be reduced 
to a specific parametric max-flow problem. We thus compare our approach (ProxFlow) with the 
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efficient parametric max-flow algorithm proposed by Gallo. Grigoriadis, and Tar- jan |22) and a 
simplified version of the latter proposed by Babenko and Goldberg in [53|. We refer to these two 
algorithms as GGT and SIMP respectively. The benchmark is established on the same datasets 
as those already used in the experimental section of the paper, namely: (1) three datasets built 
from overcomplete bases of discrete cosine transforms (DCT), with respectively 10 4 , 10 5 and 10 6 
variables, and (2) images used for the background subtraction task, composed of 57600 pixels. 
For GGT and SIMP, we use the paraF software which is a C++ parametric max-flow implementa- 
tion available at http://www.avglab.com/andrew/soft.html. Experiments were conducted on a 
single-core 2.33 Ghz. 

We report in the following table the execution time in seconds of each algorithm, as well as the 
statistics of the corresponding problems: 



Number of variables p 


10 000 


100 000 


1000 000 


57 600 


\v\ 


20 000 


200 000 


2 000 000 


75 600 


\E\ 


110 000 


500 000 


11000 000 


579 632 


ProxFlow (in sec.) 


0.4 


3.1 


113.0 


1.7 


GGT (in sec.) 


2.4 


26.0 


525.0 


16.7 


SIMP (in sec.) 


1.2 


13.1 


284.0 


8.31 



Although we provide the speed comparison for a single value of A (the one used in the corresponding 
experiments of the paper), we observed that our approach consistently outperforms GGT and SIMP 
for values of A corresponding to different regularization regimes. 
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